Contents

# load packages
suppressPackageStartupMessages({
    library(ComplexHeatmap)
    library(cowplot)
    library(ddSingleCell)
    library(dplyr)
    library(limma)
    library(scater)
})

1 Data description

We provide an examplary SingleCellExperiment (SCE) generated from a subset of data from (Kang et al. 2018). The original dataset contains 10x droplet-based scRNA-seq PBCM data from 8 Lupus patients obtained befor and after 6h-treatment with INF-beta. The complete raw data, as well as gene and cell metadata is available through the NCBI GEO, accession number GSE96583.

For size and runtime reasons, the data has been filtered to

The resulting SCE contains ~700 genes and ~6500 cells. Assay logcounts corresponds to log-normalized values obtained from normalize (package scater) with default parameters (library size normalization).

# load data
data(sce)
sce
## class: SingleCellExperiment 
## dim: 672 6456 
## metadata(2): experiment_info log.exprs.offset
## assays(2): counts logcounts
## rownames(672): ENSG00000187608 ENSG00000175756 ... ENSG00000160201
##   ENSG00000160213
## rowData names(1): feature
## colnames(6456): CGGGCATGCAGAAA-1 ATACTCTGTGACTG-1 ...
##   TTTCAGTGCGAGAG-1 TTTCTACTTCAGAC-1
## colData names(3): sample_id group_id cluster_id
## reducedDimNames(1): tsne
## spikeNames(0):

2 Data overview

As we will be aggregating measurements at the cluster-sample level, it is of particular importance to check the number of cells captured for each such instance. While aggregateData (see Section 3) allows excluding cluster-sample combinations with less than a threshold number of cells, clusters or samples with overall very low cell-counts may be excluded from further analysis at this point already.

For the Kang dataset, for example, we might consider removing the Dendritic cells and Megakaryocytes clusters, as these containg less than 50 cells across all samples.

# nb. of cells per cluster-sample
table(sce$cluster_id, sce$sample_id)
##                    
##                     ctrl1015 ctrl1256 ctrl1488 stim1015 stim1256 stim1488
##   B cells                200      200      200      200      200      200
##   CD14+ Monocytes        200      200      200      200      200      200
##   CD4 T cells            200      200      200      200      200      200
##   CD8 T cells            179      131       58      133      112       43
##   Dendritic cells         30       26       43       30       32       54
##   FCGR3A+ Monocytes      200       69      101      200       97      144
##   Megakaryocytes          20       14       18       20       17       29
##   NK cells               177      200      116      195      200      168

2.1 Dimension reduction

The dimensions reductions (DR) available within the SCE can be viewed via reducedDims from the scater package, and visualized using plotReducedDim. For our dataset, the t-SNE colored by cluster_ids (Fig. 1A) shows that cell-populations are well-separated from one another. INF-beta stimulation manifests as a severe shift in the t-SNE projection of cells when coloring by group_ids (Fig. 1B), indicating widespread, genome-scale transcriptiontal changes.

# t-SNE colored by cluster_id & group_id
ps_tsne <- lapply(c("cluster_id", "group_id"), function(x) 
    plotReducedDim(sce, use_dimred = "tsne", colour_by = x, 
        add_ticks = FALSE) + theme(aspect.ratio = 1))
plot_grid(plotlist = ps_tsne, align = "vh", labels = c("A", "B"))
t-SNE. Cells are colored by cluster ID (A) and group ID (B), respectively.

Figure 1: t-SNE
Cells are colored by cluster ID (A) and group ID (B), respectively.

3 Aggregating single-cell to pseudo-bulk data

aggregateData() will aggregate single-cell measurements by the colData variables specified with argument by, and return a SingleCellExperiment containing pseudo-bulk data. For DE analysis at the cluster-level, measurements must be aggregated at the cluster-sample level (default by = c("cluster_id", "sample_id"). In this case, the returned SingleCellExperiment will contain one assay per cluster, each with dimensions number of genes times number of samples. Argument assay and fun specify the input data and summary statistic, respectively, to use for aggregation.

pb <- aggregateData(
    x = sce, assay = "counts", fun = "sum",
    by = c("cluster_id", "sample_id"))
# one sheet per cluster
assayNames(pb)
## [1] "B cells"           "CD14+ Monocytes"   "CD4 T cells"      
## [4] "CD8 T cells"       "Dendritic cells"   "FCGR3A+ Monocytes"
## [7] "Megakaryocytes"    "NK cells"
# PBs for 1st cluster
head(assay(pb))
##                 ctrl1015 ctrl1256 ctrl1488 stim1015 stim1256 stim1488
## ENSG00000187608       60       54       36     2422     2209     2312
## ENSG00000175756       58       61       68       52       52       58
## ENSG00000157916       34       44       38       46       39       45
## ENSG00000116251      111       89      105       68       61       72
## ENSG00000116288      138      169      189      117      148      139
## ENSG00000074800      210      159      201      134      163      176

4 Pseudo-bulk level MDS plot

Prior to conducting any formal testing, we can compute a multi-dimensional scaling (MDS) plot of aggregated signal to explore overall sample similarities.

pbMDS takes as input any SCE containg PB data as returned by aggregateData, and computes MDS dimensions using edgeR. Ideally, such a represenation of the data should separate both clusters and groups from one-another, while sample from the same cluster and groups should be close to one another.

In our MDS plot on pseudo-bulk counts (Fig. 2), we can observe that the first dimension (MDS1) clearly separates cell-populations (clusters), while the second (MDS2) separates control and stimulated samples (groups). Furthermore, the two T-cell clusters fall close to each other.

pbMDS(pb)
Pseudo-bulk level MDS plot. Points represent cluster-sample instances, are colored by cluster ID, and shaped by group IDs.

Figure 2: Pseudo-bulk level MDS plot
Points represent cluster-sample instances, are colored by cluster ID, and shaped by group IDs.

5 Cluster-level DE analysis

Once we have assembled the pseudo-bulk data, we can test for cluster-level DE using runDS. We specify a design matrix capturing the experimental design using model.matrix (package stats), and a contrast matrrix that specifies our comparison of interesting using makeContrasts from the limma package. Alternatively, the comparison of interest (or a list thereof) can be specified with via coefs (see ?glmQLFTest for details).

For the Kang dataset, we want to carry out a single comparison of stimulated against control samples, thus placing "ctrl" on the right-hand side as the reference condition.

# construct design & contrast matrix
ei <- metadata(sce)$experiment_info
design <- model.matrix(~ 0 + ei$group_id)
dimnames(design) <- list(ei$sample_id, levels(ei$group_id))
contrast <- makeContrasts("stim-ctrl", levels = design)

# run DE analysis
res <- runDS(sce, pb, design, contrast, method = "edgeR", verbose = FALSE)
# access results
tbl <- res$table$`stim-ctrl`
# one data.frame per cluster
names(tbl)
## [1] "B cells"           "CD14+ Monocytes"   "CD4 T cells"      
## [4] "CD8 T cells"       "Dendritic cells"   "FCGR3A+ Monocytes"
## [7] "Megakaryocytes"    "NK cells"
# view results for 1st cluster
head(tbl[[1]])
##              gene cluster_id      logFC    logCPM            F        p_val
## 1 ENSG00000187608    B cells  5.5440098 12.669406 1231.9513592 1.991223e-13
## 2 ENSG00000175756    B cells -0.1918794  8.359226    1.8099927 2.122401e-01
## 3 ENSG00000157916    B cells  0.1785239  7.876081    0.8685524 3.697983e-01
## 4 ENSG00000116251    B cells -0.5865751  8.880400   15.8540164 1.834226e-03
## 5 ENSG00000116288    B cells -0.2784390  9.695360    3.6882618 7.896894e-02
## 6 ENSG00000074800    B cells -0.2571509  9.905444    3.4526918 8.793931e-02
##          p_adj  contrast
## 1 4.460338e-11 stim-ctrl
## 2 3.547894e-01 stim-ctrl
## 3 5.271614e-01 stim-ctrl
## 4 7.250587e-03 stim-ctrl
## 5 1.676411e-01 stim-ctrl
## 6 1.840972e-01 stim-ctrl

5.1 Results filtering & overview

To get a general overview of the differential testing results, we first filter them to retain hits FDR < 5% and abs(logFC) > 1, and count the number and frequency of differential findings by cluster. Finally, we can view the top hits (lowest adj. p-value) in each cluster.

# filter FDR < 5%, abs(logFC) > 1
tbl_fil <- lapply(tbl, function(u)
    dplyr::filter(u, p_adj < 0.05, abs(logFC) > 1))

# nb. of DE genes & % of total by cluster
n_de <- vapply(tbl_fil, nrow, numeric(1))
p_de <- format(n_de / nrow(sce) * 100, digits = 3)
data.frame("#DE" = n_de, "%DE" = p_de, check.names = FALSE)
##                   #DE   %DE
## B cells           110 16.37
## CD14+ Monocytes   189 28.12
## CD4 T cells        78 11.61
## CD8 T cells        77 11.46
## Dendritic cells   138 20.54
## FCGR3A+ Monocytes 137 20.39
## Megakaryocytes     40  5.95
## NK cells           92 13.69
# view top 2 hits in each cluster
do.call("rbind", lapply(tbl_fil, function(u) 
    dplyr::arrange(u, p_adj)[1:2, ]))
##                                gene        cluster_id     logFC    logCPM
## B cells.1           ENSG00000172183           B cells  3.239742 12.674136
## B cells.2           ENSG00000187608           B cells  5.544010 12.669406
## CD14+ Monocytes.1   ENSG00000172183   CD14+ Monocytes  6.058700 12.485934
## CD14+ Monocytes.2   ENSG00000102524   CD14+ Monocytes  6.012809 10.101459
## CD4 T cells.1       ENSG00000121858       CD4 T cells  3.934817  9.338110
## CD4 T cells.2       ENSG00000177409       CD4 T cells  4.141839  7.913198
## CD8 T cells.1       ENSG00000126709       CD8 T cells  5.345602 11.464966
## CD8 T cells.2       ENSG00000187608       CD8 T cells  4.691455 12.614579
## Dendritic cells.1   ENSG00000187608   Dendritic cells  6.108153 13.992774
## Dendritic cells.2   ENSG00000169245   Dendritic cells 10.205696 13.247729
## FCGR3A+ Monocytes.1 ENSG00000121858 FCGR3A+ Monocytes  4.024035 11.962174
## FCGR3A+ Monocytes.2 ENSG00000204389 FCGR3A+ Monocytes  3.089431  9.834430
## Megakaryocytes.1    ENSG00000172183    Megakaryocytes  3.477222 11.970324
## Megakaryocytes.2    ENSG00000119917    Megakaryocytes  6.888600 10.679613
## NK cells.1          ENSG00000187608          NK cells  4.349408 13.094877
## NK cells.2          ENSG00000172183          NK cells  3.576383 12.541496
##                              F        p_val        p_adj  contrast
## B cells.1           1545.71123 5.186932e-14 3.485618e-11 stim-ctrl
## B cells.2           1231.95136 1.991223e-13 4.460338e-11 stim-ctrl
## CD14+ Monocytes.1   1908.74907 5.951589e-11 3.999468e-08 stim-ctrl
## CD14+ Monocytes.2   1161.14911 4.460946e-10 1.498878e-07 stim-ctrl
## CD4 T cells.1        393.59382 3.553975e-09 2.388271e-06 stim-ctrl
## CD4 T cells.2        253.57367 2.826916e-08 2.599262e-06 stim-ctrl
## CD8 T cells.1        782.72158 4.715279e-12 3.168667e-09 stim-ctrl
## CD8 T cells.2        656.94645 1.289623e-11 4.333132e-09 stim-ctrl
## Dendritic cells.1    898.82092 3.146701e-11 1.969364e-08 stim-ctrl
## Dendritic cells.2    793.89006 5.861201e-11 1.969364e-08 stim-ctrl
## FCGR3A+ Monocytes.1  683.86450 2.831331e-09 1.902654e-06 stim-ctrl
## FCGR3A+ Monocytes.2  541.77665 7.363511e-09 2.474140e-06 stim-ctrl
## Megakaryocytes.1     103.29120 1.677068e-06 1.126990e-03 stim-ctrl
## Megakaryocytes.2      72.71286 7.917890e-06 2.660411e-03 stim-ctrl
## NK cells.1           817.05756 1.423041e-10 9.562837e-08 stim-ctrl
## NK cells.2           580.81496 7.111416e-10 2.389436e-07 stim-ctrl

5.2 Calculating expression frequencies

In is often worthwhile to filter DE results based on the expression frequencies of each gene, that is, the fraction of cells that express it. calcExprFreqs provides a flexible way of computing cluster-sample/-group wise expression frequencies. Here, a gene is considered to be expressed when the specified measurement value (argument assay) falls above a certain threshold (argument th). Note that, assay = "counts" and th = 0 (default) amounts to the fraction of cells for which a respective gene has been detected.

calcExprFreqs will return a SummarizedExperiment object, where sheets (assays) = clusters, rows = genes, and columns = samples (and groups, if group_ids are present in the colData of the input SCE).

frq <- calcExprFreqs(sce, assay = "counts", th = 0)
# one sheet per cluster
assayNames(frq)
## [1] "B cells"           "CD14+ Monocytes"   "CD4 T cells"      
## [4] "CD8 T cells"       "Dendritic cells"   "FCGR3A+ Monocytes"
## [7] "Megakaryocytes"    "NK cells"
# expr. freqs. for 1st cluster
head(assay(frq))
##                 ctrl1015 ctrl1256 ctrl1488 stim1015 stim1256 stim1488
## ENSG00000187608    0.180    0.185    0.135    1.000    0.995    1.000
## ENSG00000175756    0.245    0.230    0.225    0.240    0.225    0.240
## ENSG00000157916    0.150    0.180    0.170    0.190    0.170    0.205
## ENSG00000116251    0.420    0.325    0.365    0.255    0.275    0.300
## ENSG00000116288    0.405    0.490    0.470    0.380    0.460    0.385
## ENSG00000074800    0.480    0.355    0.435    0.325    0.400    0.365
##                      ctrl      stim
## ENSG00000187608 0.1666667 0.9983333
## ENSG00000175756 0.2333333 0.2350000
## ENSG00000157916 0.1666667 0.1883333
## ENSG00000116251 0.3700000 0.2766667
## ENSG00000116288 0.4550000 0.4083333
## ENSG00000074800 0.4233333 0.3633333

5.3 Formatting results

Especially when testing multiple contrasts or coefficients, the results returned by runDS may become very complex and unhandy for exploration or exporting. Results can be formatted using resDS, which provides two alternative modes for formatting: bind = "row"/"col".

When bind = "row", results from all comparisons will be merged vertically (analogouse to do.call("rbind", ...)) into a tidy format table, with column contrast/coef specifying the comparison.

Otherwise, bind = "col", results will be merge horizontally into a single wide table where all results for a given gene and cluster are kept in one row. An identifier of the respective contrast of coefficient is then appended to the column names. This format is useful when wanting to view a specific gene’s behavior across, for example, multiple treatments, but will become messy when many comparisons are included.

Expression frequencies computed with calcExprFreqs, as well as cluster-sample level avg. CPM, can be included in the results by setting frq/cpm = TRUE. Alternatively, if the former have been pre-computed, they can be supplied directly as an input to resDS (see example below).

tbl_big <- resDS(sce, res, bind = "row", frq = frq, cpm = FALSE)
tbl_tdy <- resDS(sce, res, bind = "col", frq = frq, cpm = FALSE)

6 Visualizing results

We first generate a set of t-SNEs colored by gene expression for the top DE gene in each cluster. To facilitate matching the affected cells to their cluster and experimental group, we also add the t-SNEs colored by cluster and group ID, respectively, that are plotted above (Fig. 1).

# get cluster IDs
cluster_ids <- levels(sce$cluster_id)
names(cluster_ids) <- cluster_ids

ps <- lapply(cluster_ids, function(k) {
    u <- dplyr::arrange(tbl[[k]], p_adj) # sort by adj. p-value
    g <- u$gene[1]                       # get top hit
    plotReducedDim(sce, "tsne", colour_by = g, add_ticks = FALSE) + 
        ggtitle(sprintf("%s(%s)", g, k)) + theme_void() + 
        theme(aspect.ratio = 1, legend.position = "none",
            plot.title = element_text(size = 8))
})
row1 <- plot_grid(plotlist = ps_tsne, ncol = 2, align = "vh")
row2 <- plot_grid(plotlist = ps, ncol = 4, align = "vh")
plot_grid(row1, row2, ncol = 1, rel_heights = c(2, 3))

For changes of high interest, we can view the cell-level expression profiles of a specific gene across samples or groups using plotExpression (package scater).

Here, we generate violins plot for the top DE genes (lowest adj. p-value) in the first three clusters (Fig. 3). Note that, as we are testing for DE on the cluster-level, we need to subset the cells that have been assigned to a given cluster for plotting.

# generate violins for top hits by cluster
ps <- lapply(cluster_ids[seq_len(3)], function(k) {
    u <- dplyr::arrange(tbl[[k]], p_adj)       # sort by adj. p-value
    gs <- u$gene[seq_len(5)]                   # get top hits
    plotExpression(sce[, sce$cluster_id == k], # subset this cluster
        features = gs, x = "sample_id", colour_by = "group_id", ncol = 5) + 
        ggtitle(k) + theme(axis.text.x = element_text(angle = 30, hjust = 1))
})
plot_grid(plotlist = ps, ncol = 1)
Violin plots. Show are the top 5 hits (lowest adj. p-value) for the 1st 3 clustes.

Figure 3: Violin plots
Show are the top 5 hits (lowest adj. p-value) for the 1st 3 clustes.

Especially when wanting to gain an overview of numerous DE testing results for many clusters, bothm dimension reduction and cell-level visualisations require a lot of space can become cumbersome to interpret. In this setting, it is thus recommendable to visualise aggregated measures, e.g., mean expressions by cluster sample. We can use aggregateData to assemble cluster-sample level mean expression values for all genes, and visualize any hits of interest.

# compute cluster-sample expression means
ms <- aggregateData(sce, assay = "logcounts", fun = "mean")
ms <- data.frame(
    row.names = NULL, gene = rownames(ms),
    cluster_id = rep(assayNames(ms), each = nrow(sce)),
    do.call("rbind", as.list(assays(ms))))
head(ms)
##              gene cluster_id  ctrl1015  ctrl1256  ctrl1488  stim1015
## 1 ENSG00000187608    B cells 0.2705948 0.2658873 0.1619929 3.7980880
## 2 ENSG00000175756    B cells 0.3039544 0.2996701 0.2871825 0.2800285
## 3 ENSG00000157916    B cells 0.1686634 0.2385715 0.2061330 0.2350590
## 4 ENSG00000116251    B cells 0.5670806 0.4716987 0.5039134 0.3482990
## 5 ENSG00000116288    B cells 0.6018059 0.7405015 0.7346284 0.5432610
## 6 ENSG00000074800    B cells 0.7462413 0.5688830 0.6882017 0.5042638
##    stim1256  stim1488
## 1 3.6442352 3.6619709
## 2 0.2806413 0.2786930
## 3 0.1996890 0.2322464
## 4 0.3455977 0.3706336
## 5 0.6794385 0.5502087
## 6 0.6264986 0.5530005
n <- 3 # nb. of genes to plot per cluster
top <- lapply(cluster_ids, function(k) {
    u <- dplyr::arrange(tbl[[k]], p_adj)
    u <- u[seq_len(n), ]
    ms %>% dplyr::filter(gene %in% u$gene & cluster_id %in% u$cluster_id)
})
# assemble means from all clusters
top <- do.call("rbind", top)
# set rownames & remove un-needed columns
rownames(top) <- with(top, sprintf("%s(%s)", gene, cluster_id))
top <- select(top, -c("gene", "cluster_id"))
# plot heatmap of cluster-sample expression means
Heatmap(top,
    cluster_rows = FALSE,
    cluster_columns = FALSE)

plotDiffGenes

# plot top results for a single cluster
plotDiffGenes(sce, res, k = "B cells")

# plot single gene across all clusters
plotDiffGenes(sce, res, g = rownames(sce)[1])

7 Session info

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] parallel  stats4    grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] scater_1.10.1               SingleCellExperiment_1.4.1 
##  [3] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
##  [5] BiocParallel_1.16.6         matrixStats_0.54.0         
##  [7] Biobase_2.42.0              GenomicRanges_1.34.0       
##  [9] GenomeInfoDb_1.18.2         IRanges_2.16.0             
## [11] S4Vectors_0.20.1            BiocGenerics_0.28.0        
## [13] limma_3.38.3                dplyr_0.8.0.1              
## [15] ddSingleCell_0.99.7         cowplot_0.9.4              
## [17] ggplot2_3.1.0               ComplexHeatmap_1.20.0      
## [19] BiocStyle_2.10.0           
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1            edgeR_3.24.3            
##  [3] splines_3.5.1            viridisLite_0.3.0       
##  [5] DelayedMatrixStats_1.4.0 assertthat_0.2.0        
##  [7] highr_0.7                BiocManager_1.30.4      
##  [9] vipor_0.4.5              GenomeInfoDbData_1.2.0  
## [11] yaml_2.2.0               pillar_1.3.1            
## [13] lattice_0.20-38          glue_1.3.0              
## [15] digest_0.6.18            RColorBrewer_1.1-2      
## [17] XVector_0.22.0           colorspace_1.4-0        
## [19] htmltools_0.3.6          Matrix_1.2-15           
## [21] plyr_1.8.4               pkgconfig_2.0.2         
## [23] GetoptLong_0.1.7         bookdown_0.9            
## [25] zlibbioc_1.28.0          purrr_0.3.0             
## [27] scales_1.0.0             HDF5Array_1.10.1        
## [29] tibble_2.0.1             withr_2.1.2             
## [31] lazyeval_0.2.1           magrittr_1.5            
## [33] crayon_1.3.4             evaluate_0.13           
## [35] MAST_1.8.2               beeswarm_0.2.3          
## [37] tools_3.5.1              data.table_1.12.0       
## [39] GlobalOptions_0.1.0      stringr_1.4.0           
## [41] Rhdf5lib_1.4.2           munsell_0.5.0           
## [43] locfit_1.5-9.1           compiler_3.5.1          
## [45] rlang_0.3.1              rhdf5_2.26.2            
## [47] RCurl_1.95-4.11          rjson_0.2.20            
## [49] circlize_0.4.5           labeling_0.3            
## [51] bitops_1.0-6             rmarkdown_1.11          
## [53] gtable_0.2.0             abind_1.4-5             
## [55] reshape2_1.4.3           R6_2.4.0                
## [57] gridExtra_2.3            knitr_1.21              
## [59] shape_1.4.4              ggbeeswarm_0.6.0        
## [61] stringi_1.3.1            Rcpp_1.0.0              
## [63] tidyselect_0.2.5         xfun_0.4

References

Kang, Hyun Min, Meena Subramaniam, Sasha Targ, Michelle Nguyen, Lenka Maliskova, Elizabeth McCarthy, Eunice Wan, et al. 2018. “Multiplexed Droplet Single-Cell Rna-Sequencing Using Natural Genetic Variation.” Nat Biotechnol 36 (1): 89–94. https://doi.org/10.1038/nbt.4042.